Analysis of Data Analyst Job Openings

As a former technical recruiter, I was interested in taking a look at the job market for data professionals amidst the pandemic. Here we'll be looking at Data Analyst positions specifically.

You can use this analysis to get an idea about what companies and industries in general have been actively hiring even during the pandemic and where you should focus your job search depending on various factors such as location, company size, etc.

This analysis can also be useful to recruiters, especially staffing agencies, when it comes to finding out information about their competitors and also scoping out new clients.

All of my data comes from Kaggle and you can find the dataset here: https://www.kaggle.com/andrewmvd/data-analyst-jobs

Importing/Cleaning Data

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#using plotly offline
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
import plotly.graph_objs as go
import plotly.express as px
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')
In [2]:
df = pd.read_csv('DataAnalyst.csv')
df.head()
Out[2]:
Unnamed: 0 Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) -1 True
1 1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) -1 -1
2 2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy -1
3 3 Data Analyst $37K-$66K (Glassdoor est.) Requisition NumberRR-0001939\nRemote:Yes\nWe c... 4.1 Celerity\n4.1 New York, NY McLean, VA 201 to 500 employees 2002 Subsidiary or Business Segment IT Services Information Technology $50 to $100 million (USD) -1 -1
4 4 Reporting Data Analyst $37K-$66K (Glassdoor est.) ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... 3.9 FanDuel\n3.9 New York, NY New York, NY 501 to 1000 employees 2009 Company - Private Sports & Recreation Arts, Entertainment & Recreation $100 to $500 million (USD) DraftKings True
In [3]:
#dropping columns I won't be using 
df = df.drop(['Unnamed: 0', 'Founded', 'Competitors', 'Easy Apply'],  axis=1)
In [4]:
#taking away the rating from the end of each company name 
df['Company Name'] = df['Company Name'].map(lambda x: str(x)[:-4])
In [5]:
#making location names just the State Abbreviation rather than city, state
df['Location'] = df['Location'].map(lambda x: str(x)[-2:])
In [6]:
#making Salary Estimates just the range and taking away any extra info 
df['Salary Estimate'] = df['Salary Estimate'].map(lambda x: str(x)[:-16])
In [7]:
#this dataset sets nan values as -1 instead of nan so I am switching this to nan values and then dropping them 
df=df.replace(-1,np.nan)
df=df.replace(-1.0,np.nan)
df=df.replace('-1',np.nan)
In [8]:
#dropping rows containing with any values that are NA
df = df.dropna()
In [9]:
#final dataframe 
df.head()
Out[9]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Type of ownership Industry Sector Revenue
0 Data Analyst, Center on Immigration and Justic... $37K-$66K Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice NY New York, NY 201 to 500 employees Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD)
1 Quality Data Analyst $37K-$66K Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York NY New York, NY 10000+ employees Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD)
2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace NY New York, NY 1001 to 5000 employees Company - Private Internet Information Technology Unknown / Non-Applicable
3 Data Analyst $37K-$66K Requisition NumberRR-0001939\nRemote:Yes\nWe c... 4.1 Celerity NY McLean, VA 201 to 500 employees Subsidiary or Business Segment IT Services Information Technology $50 to $100 million (USD)
4 Reporting Data Analyst $37K-$66K ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... 3.9 FanDuel NY New York, NY 501 to 1000 employees Company - Private Sports & Recreation Arts, Entertainment & Recreation $100 to $500 million (USD)

Data Analysis

Job Titles of Openings

In [10]:
#total number of rows 
df['Job Title'].count()
Out[10]:
1855
In [11]:
from wordcloud import WordCloud

def wordcloud(column):
    title = df[column].tolist()
    wordcloud = WordCloud(max_font_size=50, max_words=75, background_color="white").generate(' '.join(title))
    plt.figure(figsize=(10,12))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(str(column) + ' Word Cloud \n')
    plt.axis("off")
    plt.show()

wordcloud('Job Title')
wordcloud('Job Description')

These would be some good keywords to use when searching for data analyst positions across job boards and also that might be good to add to your resume if looking for a data analyst position.

In [12]:
job_title = df['Job Title'].value_counts()
title_count = pd.DataFrame(data=job_title).reset_index()
title_count.rename(columns={'index':'Job Title', 'Job Title':'Total Openings'}, inplace=True)
title_count_head = title_count.head(10)


fig = px.bar(title_count_head, x='Job Title', y='Total Openings', color='Job Title', text='Total Openings',
            color_discrete_sequence= px.colors.qualitative.Safe, width=900 , height=600, 
             title='Job Titles With the Most Openings')
fig.update_traces(textposition='outside')
fig.show()

It looks like the most common Job Titles for Data Analyst positions are 'Data Analyst','Senior Data Analyst', and 'Junior Data Analyst' - so pretty general job titles rather than the more specific/fancier titles.

Positions by Industry and Sector

In [13]:
#creating dataframe for total positions by industry 
count = df['Industry'].value_counts()
industry_count = pd.DataFrame(data=count).reset_index()
industry_count.rename(columns={'index':'Industry', 'Industry':'Total'}, inplace=True)
industry_count.head()
Out[13]:
Industry Total
0 IT Services 316
1 Staffing & Outsourcing 316
2 Health Care Services & Hospitals 150
3 Computer Hardware & Software 110
4 Consulting 110
In [14]:
#creating dataframe for total positions by sector 
count_sec = df['Sector'].value_counts()
sector_count = pd.DataFrame(data=count_sec).reset_index()
sector_count.rename(columns={'index':'Sector', 'Sector':'Total'}, inplace=True)
sector_count.head()
Out[14]:
Sector Total
0 Information Technology 557
1 Business Services 508
2 Finance 166
3 Health Care 150
4 Education 51
In [15]:
#I will be using pie charts a lot throughout my analysis so I'm defining a function for it 
def pie_chart(df, x, y):
    fig = px.pie(df, values=x, names=y,
                 title='Data Analyst Positions by '+ str(y),
                 labels=y, height = 600, color=y, color_discrete_sequence= px.colors.qualitative.Safe)
    fig.update_traces(textposition='inside', textinfo='percent+label')
    fig.show()

pie_chart(industry_count, 'Total', 'Industry') 
pie_chart(sector_count, 'Total', 'Sector')

Company Size of Current Openings - What size companies are hiring the most?

In [16]:
size = df['Size'].value_counts()
size_count = pd.DataFrame(data=size).reset_index()
size_count.rename(columns={'index':'Size', 'Size':'Number of Companies'}, inplace=True)

fig = px.bar(size_count, x='Size', y='Number of Companies', color='Size', text='Number of Companies', 
                 color_discrete_sequence= px.colors.qualitative.Safe , 
            width=900 , height=600, title='Openings by Company Size')
fig.update_traces(textposition='outside')
fig.show()

It's interesting to see that smaller companies (51-200 employees) are actually hiring the most Data Analysts, with larger companies (10,000+ employees) coming in second.

Where are the most job openings located?

In [17]:
#creating dataframe for total number of positions by location 
count2 = df['Location'].value_counts()
location_count = pd.DataFrame(data=count2).reset_index()
location_count.rename(columns={'index':'Location', 'Location':'Total'}, inplace=True)

fig = px.choropleth(location_count, locations = 'Location' , locationmode ="USA-states", scope='usa',
                    color='Total', title ="Data Analyst Positions by Location", width=900, 
                    height=600, color_continuous_scale=px.colors.sequential.deep)
fig.show()

As you can clearly see on the map - California, Texas, and New York seem to be hottest locations for Data Analyst openings.

Companies with the most openings - looking at the top 15 with and without Staffing/IT Services Companies included

In [18]:
#companies with the most data analyst openings 
count3 = df['Company Name'].value_counts()
company_count = pd.DataFrame(data=count3).reset_index()
company_count.rename(columns={'index':'Company', 'Company Name':'Total Job Openings'}, inplace=True)
top_10 = company_count.head(10)

#companies with the most data analyst openings not including companies who categorize themselves as staffing 
#agencies or IT Services organizations 
staffing = df.loc[df['Industry'] != 'Staffing & Outsourcing'] 
staffing = staffing.loc[staffing['Industry'] != 'IT Services']
count4 = staffing['Company Name'].value_counts()
minus_staffing = pd.DataFrame(data=count4).reset_index()
minus_staffing.rename(columns={'index':'Company', 'Company Name':'Total Job Openings'}, inplace=True)
minus_staffing_top = minus_staffing.head(10)
In [19]:
from plotly.subplots import make_subplots

fig = make_subplots(subplot_titles=("Including Staffing and IT Services Agencies \n","Not Including Staffing and IT Services Agencies \n"),
                   rows=2, cols=1, vertical_spacing=0.30)
fig.add_trace(go.Bar(x=top_10['Company'], y=top_10['Total Job Openings'], 
            marker=dict(color=['rgb(136, 204, 238)', 'rgb(204, 102, 119)', 'rgb(221, 204, 119)', 
                               'rgb(17, 119, 51)', 'rgb(51, 34, 136)', 'rgb(170, 68, 153)', 
                               'rgb(68, 170, 153)', 'rgb(153, 153, 51)', 'rgb(136, 34, 85)', 
                               'rgb(102, 17, 0)'])), row=1, col=1)
fig.add_trace(go.Bar(x=minus_staffing_top['Company'], y=minus_staffing_top['Total Job Openings'],
               marker=dict(color=['rgb(136, 204, 238)', 'rgb(204, 102, 119)', 'rgb(221, 204, 119)', 
                               'rgb(17, 119, 51)', 'rgb(51, 34, 136)', 'rgb(170, 68, 153)', 
                               'rgb(68, 170, 153)', 'rgb(153, 153, 51)', 'rgb(136, 34, 85)', 
                               'rgb(102, 17, 0)'])), row=2, col=1) 

fig.update_layout(showlegend=False, height = 900, width=800,title_text = 'Top 15 Companies with the Most Openings')
fig.show()

Some people prefer not to work with Staffing/Third-Party agencies, so I wanted to show the companies with the most openings with and without companies that categorize themselves as a Staffing/IT Services organization.

Note - some of the companies for the top 10 not including staffing agencies might still include a staffing agency or two simply because these companies don't actually categorize themselves as one.

Total Openings by Salary Estimates

In [20]:
salary = df['Salary Estimate'].value_counts()
salary_count = pd.DataFrame(data=salary).reset_index()
salary_count.rename(columns={'index':'Salary', 'Salary Estimate':'Total Job Openings'}, inplace=True)
salary_count.sort_values(by='Total Job Openings', ascending=True)
salary_head = salary_count.head(10)

fig = px.bar(salary_head, x='Total Job Openings', y='Salary', orientation='h', color='Salary', height = 500, 
             text='Total Job Openings',width=800, color_discrete_sequence= px.colors.qualitative.Safe)
fig.update_layout(title_text = 'Total Openings by Salary', showlegend=True)
fig.update_traces(textposition='outside')
fig.show()

Suggesting Company Based on Preferences

In [21]:
#playing around with KNN for the first time 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_sector = pd.DataFrame(df['Sector'], columns=['Sector'])
df_sector['Sector_Encoded'] = le.fit_transform(df_sector['Sector'])
df_labels = df_sector.drop_duplicates(subset=['Sector'])
df_labels
Out[21]:
Sector Sector_Encoded
0 Non-Profit 16
1 Health Care 10
2 Information Technology 11
4 Arts, Entertainment & Recreation 2
5 Finance 8
7 Insurance 12
9 Business Services 4
23 Restaurants, Bars & Food Services 19
50 Media 14
57 Accounting & Legal 0
59 Real Estate 18
75 Government 9
100 Consumer Services 6
117 Biotech & Pharmaceuticals 3
141 Education 7
202 Construction, Repair & Maintenance 5
311 Oil, Gas, Energy & Utilities 17
334 Retail 20
399 Manufacturing 13
457 Aerospace & Defense 1
482 Telecommunications 21
730 Transportation & Logistics 22
1090 Mining & Metals 15
2171 Travel & Tourism 23
In [22]:
features=list(zip(df_sector['Sector_Encoded'], df['Rating']))
In [23]:
from sklearn.neighbors import KNeighborsClassifier

#setting KNN=the sqrt of 1560 and setting the training sets
model = KNeighborsClassifier(n_neighbors=39)
model.fit(features, df['Company Name'])
Out[23]:
KNeighborsClassifier(n_neighbors=39)

Let's say you want a job in the Education sector at a company with a 4.0 rating. What company has openings that you should take a look at?

In [24]:
#if you look at df_labels you can see Education is encoded as 7 so we would use that number and 4.0 for rating
trial = model.predict([[7, 4.0]])
print(trial)
['UC San Diego']

Now let's say someone else is looking for a Non-Profit position at a top rated company. What company has openings that they should take a look at?

In [25]:
trial2 = model.predict([[5, 5.0]])
print(trial2)
['Orbis, Inc.']